CFSTRenD_Workflow.RmdThis vignette demonstrates the main workflow of the CFSTRenD package for manipulating, analyzing, and visualizing tree-ring data and model performance.
The workflow includes:
Data loading, to import raw measurements and prepare them for analysis.
Data processing, which organizes and formats the data for subsequent steps.
Quality assessment, ensuring the reliability and consistency of the measurements.
Modeling, where growth trends and patterns are analyzed.
As illustrated in the following diagram, visual inspection is provided at each step to facilitate assessment of data quality and validation of results.
# Example: load a dataset included in the package
# ring measurement
samples69.o <- fread(system.file("extdata", "samples69.csv", package = "CFSTRenD"))
# climate
clim69 <- fread(system.file("extdata", "clim69.csv", package = "CFSTRenD"))
# formatting the users' data conformed to CFSTRenD
#
samples69.trt <- CFS_format(data = list(samples69.o, 39:140), usage = 1, out.csv = NULL)
#> [1] "you have filled all the mandatory information"
#> [1] "you have filled all the un-mandatory information"
class(samples69.trt)
#> [1] "cfs_format"data
All information should be provided in a single file in wide format, with metadata first, followed by the ring-width measurements (in mm).
The column names for the ring-width measurements can follow two formats to indicate the year of measurement:
Directly as the year (e.g., 1980)
Prefixed with a character (e.g., X1980)
It is highly recommended that the ring-width measurement columns are ordered by year and consecutive in the dataset, as the column indices will be used as input for the function CFS_format().
data = list (samples69.o, 39:140)), the second item refers to the column indices
usage
If users intend to submit their data to the CFSTRenD online repository, set usage = 1 in the function. This will enable the function to format the data structure and perform detailed checks, including column names, geographic coordinates, species, and other requirements to conform to the CFSTRenD collection standards.
Otherwise, use usage = 2 to perform a reduced checking procedure, which still builds the CFSTRenD structure but skips some of the detailed validations.
out.csv
if user wants to export the processed tables in csv format, specify the folder here. the default is NULL.
Note: Running the function CFS_format() is the first and mandatory step before using any other functions in the CFSTRenD package. The data provided in this tutorial is already prepared to run the vignette; in practice, users may need to add or modify their own data based on the messages generated by the function.
The data report provides an overview of the tree ring data’s quality and characteristics at four levels: project, project-species, project-species-site, and project-species-radii, including the quality assessment at site and radii levels with the default parameters. More details on quality assessment will be presented next section.
outfile_data <- tempfile(fileext = ".html")
generate_report(robj = samples69.trt, qa.label_data = "CFS-TRenD V1.2 proj69 ", data_report.reports_sel = c(1,2,3,4), output_file = outfile_data)robj
The input for the data report is the output of the CFS_format() function, which assigns the class “cfs_format” to the resulting object.
qa.label_data
A short description of the input dataset. This text will appear in the report as the data source for the generated figures.
data_report.reports_sel
This argument specifies the level of data summaries to be included in the reports. Valid options are 1, 2, 3, or 4, each corresponding to one of the four available report types. In this tutorial, we demonstrate only the project–species level summary.
output_file
This argument allows users to export the HTML-formatted report to a specified location by providing a folder and filename (e.g., “path/to/report.html”). If left as NULL (default), the report will not be saved to disk and will instead open directly in the browser for viewing.
This report provides an overview of the tree ring data’s quality and characteristics at four levels:
1. project: Data Completeness: Assessment of missing or incomplete data of the whole data;
2. project-species: Data Summary: Summary statistics and descriptions;
3. project-species-site: data summary tables and series graphing;
4. project-species-site-radii: Correlation Analysis and quality assessment.
project name: Douglas-fir retrospective monitoring
selected reports: 1, 2, 3, 4
This table presents the completeness of each variable of the whole dataset as a percentage. A value of 0 indicates no effective data. Please carefully verify that all required data has been included in the submission.
| var | pct |
|---|---|
| var | pct |
|---|---|
| var | pct |
|---|---|
This section presents key summary statistics, including spatial and temporal ranges, summary of ring width measurements, series length, etc., categorized by species.
In this dataset, there’s 1 species: PSEUMEN
*Number of series that passed the test of CFS_qa() on differentiated
series
**The values refer to mean ± sd (min, max)
This section presents site-level data summaries, including a figure showing ring width measurements over time and a table with key statistics.
| PSEUMEN | ||||||||
*ratio between median of rw of the site and median of rw of
its 10 nearest neighbors.
**The values refer to mean ± sd (min, max)
This table provides a summary for each series, including raw ring width measurements, autocorrelation, correlation with master chronology for both raw and differentiated data, and quality assessment code (qa_code) which was derived from the CFS_qa() function. The master chronology includes all the series with qa_code ‘pass’.
| Description of qa_code | |
| qa_code | Description |
|---|---|
| pass | The maximum correlation occurs at lag 0 |
| borderline | The correlation at lag 0 ranks as the second highest, and its difference from the maximum remains within a predefined threshold, categorizing as a quasi-pass |
| pm1 | The maximum correlation occurs at lag 1 or -1, suggesting slight misalignment. |
| highpeak | The maximum correlation occurs at a non-zero lag and is more than twice the second-highest value, potentially signaling an issue |
| fail | All other measurements that do not fit into the aforementioned categories fall under this classification. |
| PSEUMEN | |||||||||
*developed from raw series
**developed from differentiated series
&correlation with master chronology, the value represents
correlation (p-value)
%qa_code is identified using the current data
as reference dataset
Ensuring the quality of data measurements is critical to the success of any data collection effort. In this work, particular attention is given to the possibility of measurement errors and incorrect data transformations, as these can distort tree-ring width values and consequently impact downstream analyses.
Ring-width measurements from certain sites may occasionally appear unusually high or low, often due to transformation errors, as tree-ring data is usually stored as integers with an associated scale factor. The CFS_scale() function addresses this by applying a k-nearest neighbors (k-NN) approach, using geodesic distances on the WGS84 ellipsoid (via the distGeo function from the geosphere package), to identify geographically close sites. It then compares the median tree-ring measurements of the target site to those of its nearest neighbors. This procedure is conducted within the same species.
target_site
The target site refers to a single site to be evaluated and includes at least five columns: species, uid_site, site_id, latitude, and longitude.
ref_sites
The reference sites refer to a dataset of ring-width measurements that includes the target site. In addition to the columns present in the target site, the dataset also contains uid_radius, year, and rw_mm.
scale.label_data_ref
a short description of reference dataset.This text will appear in the report as the data source for the generated figures.
scale.N_nbs
This specifies the maximum number of neighbors to be considered in the procedure.
scale.max_dist_km
This specifies the maximum distance (in kilometers) for searching neighbors of the target site.
outfile_scale <- tempfile(fileext = ".html")
generate_report(robj = dt.scale, output_file = outfile_scale)robj
The input for the scale report is the output of the CFS_scale() function, which assigns the class “cfs_scale” to the resulting object.
This report compares the median ring-width of the target site X003b with its 10 nearest neighbors in the dataset CFSTRenD V1.2-proj69 , using a maximum search distance of 20 km.
Tree-ring measurements often exhibit long-term growth trends and interannual variability, which can obscure short-term anomalies and complicate the assessment of data quality (e.g., Bunn 2008; Holmes 1983). Measurement instruments may also introduce inaccuracies due to their limitations. The aim of this exercise is to identify and classify whether ring-width measurements are accurate by using the cross-correlation function (CCF) applied to a treated series (consecutive differences).
dt.input
The input dataset must include at least five columns: species, SampleID, Year, RawRing, and RW_trt. SampleID identifies each series, RawRing contains the raw ring-width measurements in millimeters, and RW_trt contains transformed measurements suitable for constructing a master chronology, according to the user’s choice. In this example, consecutive differences of the series were used for RW_trt.
qa.label_data
a short description of the input dataset.This text will appear in the report as data source for the generated figures.
qa.label_trt
a short description of the treated series.This text will appear in the report.
dt.qa$dt.ccf[qa_code != "pass"]
outfile_qa <- tempfile(fileext = ".html")
generate_report(robj = dt.qa, qa.out_series = c("X003_101_005", "X011_104_003", "X016_104_003"), output_file = outfile_qa)robj
The input for the cross-dating report is the output of the CFS_qa() function, which assigns the class “cfs_qa” to the resulting object.
qa.out_series
This argument allows users to select which series to examine visually. By default, all series are included, which may result in unnecessarily long processing times.This report provides an overview of the quality of each series with 4 graphs:
raw ring measurement vs. year;
cross-correlation plots of raw ring measurement with master chronology (raw)
treated ring measurement vs. year;
cross-correlation plots of treated ring measurement with master chronology (treated). “qa_code” is attached to this plot.
below is the description of qa_code.
| Description of qa_code | |
| qa_code | Description |
|---|---|
| pass | The maximum correlation occurs at lag 0 |
| borderline | The correlation at lag 0 ranks as the second highest, and its difference from the maximum remains within a predefined threshold, categorizing as a quasi-pass |
| pm1 | The maximum correlation occurs at lag 1 or -1, suggesting slight misalignment. |
| highpeak | The maximum correlation occurs at a non-zero lag and is more than twice the second-highest value, potentially signaling an issue |
| fail | All other measurements that do not fit into the aforementioned categories fall under this classification. |
GAMMs handle nonlinear responses and can include random effects (e.g., site or tree identity) to account for hierarchical structures and temporal or spatial dependencies, making them well-suited for modeling complex dendrochronological data. in CFSTRenD package, a suite of GAM and GAMM models has been implemented to accommodate different types of datasets. Here, we use one model, gamm_spatial, to demonstrate how to generate a fitting and diagnostic model report.
resp_scale m.candidates
outfile_mod <- tempfile(fileext = ".html")
generate_report(robj = m.sp, output_file = outfile_mod)robj
The input for the cross-dating report is the output of the CFS_qa() function, which assigns the class “cfs_qa” to the resulting object.
This report presents the results of a generalized additive model
(GAM) analysis.
The objective is to evaluate predictor contributions, describe the
functional forms of relationships,
and assess the adequacy of the fitted model.
The following sections include a model summary, smooth term importance,
partial effect visualizations,
and diagnostic checks to ensure the robustness of the analysis.
The following output provides the summary of the fitted GAM
model.
It includes estimated coefficients, smooth terms, approximate
significance of predictors,
and overall model fit statistics.
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> log(bai_cm2) ~ log(ba_cm2_t_1) + s(ageC) + s(FFD) + s(uid_site.fac,
#> bs = "re")
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.36110 0.16585 -20.27 <2e-16 ***
#> log(ba_cm2_t_1) 1.00098 0.02725 36.74 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(ageC) 7.727 7.727 197.45 <2e-16 ***
#> s(FFD) 8.202 8.202 78.30 <2e-16 ***
#> s(uid_site.fac) 16.529 19.000 8.09 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.69
#> Scale est. = 0.14552 n = 13272
The relative contribution of predictors is evaluated by calculating the importance percentage of each smooth term, based on ssq method. This indicates how much each variable contributes to explaining variation in the response.
| Term | Score (%) |
|---|---|
| s(ageC) | 90.8 |
| s(uid_site.fac) | 8.6 |
| s(FFD) | 0.6 |
Partial effect plots illustrate the shape of the relationship between
the response and each predictor, while holding other predictors
constant. These visualizations help identify nonlinear trends and assess
whether effects are monotonic, threshold-like, or more complex.
Diagnostic checks evaluate whether the fitted GAM meets assumptions
of independence, normality, and sufficient smoothness. Residuals,
k-index, and qq-plots provide evidence of model adequacy or potential
overfitting.
#>
#> 'gamm' based fit - care required with interpretation.
#> Checks based on working residuals may be misleading.
#> Basis dimension (k) checking results. Low p-value (k-index<1) may
#> indicate that k is too low, especially if edf is close to k'.
#>
#> k' edf k-index p-value
#> s(ageC) 9.00 7.73 1.00 0.46
#> s(FFD) 9.00 8.20 0.94 <2e-16 ***
#> s(uid_site.fac) 20.00 16.53 NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1